library(faraway) # Load faraway package to use data

Q1.

a.

# Creating boxplot for the processes (treatments):
plot(yield ~ treat, penicillin, xlab = "Process", ylab = "Yield",
     main = "Production of Penicillin for the 4 Processes")

# Creating boxplot for the blends (blocks):
plot(yield ~ blend, penicillin, xlab = "Blend", ylab = "Yield",
     main = "Production of Penicillin for the 5 Blends")

# Creating interaction plot with processes (treatments) as the x-axis:
with(penicillin, interaction.plot(treat, blend, yield, xlab = "Process", ylab = "Yield",
                                  main = "Production of Penicillin by the 4 Processes"))

# Creating interaction plot with blends (blocks) as the x-axis:
with(penicillin, interaction.plot(blend, treat, yield, xlab = "Blend", ylab = "Yield",
                                  main = "Production of Penicillin by the 5 Blends"))

b.

# Fitting linear model:
penicillin.lm <- lm(yield ~ treat + blend, data = penicillin)
summary(penicillin.lm) # Get results
## 
## Call:
## lm(formula = yield ~ treat + blend, data = penicillin)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5.00  -2.25  -0.50   2.25   6.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   90.000      2.745  32.791  4.1e-13 ***
## treatB         1.000      2.745   0.364  0.72194    
## treatC         5.000      2.745   1.822  0.09351 .  
## treatD         2.000      2.745   0.729  0.48018    
## blendBlend2   -9.000      3.069  -2.933  0.01254 *  
## blendBlend3   -7.000      3.069  -2.281  0.04159 *  
## blendBlend4   -4.000      3.069  -1.304  0.21686    
## blendBlend5  -10.000      3.069  -3.259  0.00684 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.34 on 12 degrees of freedom
## Multiple R-squared:  0.5964, Adjusted R-squared:  0.361 
## F-statistic: 2.534 on 7 and 12 DF,  p-value: 0.07535
# Testing each variable with respect to the full linear model:
drop1(penicillin.lm, test = "F")
## Single term deletions
## 
## Model:
## yield ~ treat + blend
##        Df Sum of Sq RSS    AIC F value  Pr(>F)  
## <none>              226 64.496                  
## treat   3        70 296 63.893  1.2389 0.33866  
## blend   4       264 490 71.973  3.5044 0.04075 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The treatment effect has a p-value of 0.33866 (large p-value), which is greater than the 5% significance level (alpha = 0.05) so therefore there is strong evidence suggesting that the treatment effect is not statistically significant. Hence, I conclude that there is not a significant difference between the treatments.

c.

# Testing each variable with respect to the full linear model:
drop1(penicillin.lm, test = "F")
## Single term deletions
## 
## Model:
## yield ~ treat + blend
##        Df Sum of Sq RSS    AIC F value  Pr(>F)  
## <none>              226 64.496                  
## treat   3        70 296 63.893  1.2389 0.33866  
## blend   4       264 490 71.973  3.5044 0.04075 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Fitting linear model without the blend (blocking) effect:
penicillin.lm.crd <- lm(yield ~ treat, data = penicillin)
summary(penicillin.lm.crd) # Get results
## 
## Call:
## lm(formula = yield ~ treat, data = penicillin)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9.00  -3.25   0.00   3.00   8.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   84.000      2.475  33.941 2.45e-16 ***
## treatB         1.000      3.500   0.286    0.779    
## treatC         5.000      3.500   1.429    0.172    
## treatD         2.000      3.500   0.571    0.576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.534 on 16 degrees of freedom
## Multiple R-squared:  0.125,  Adjusted R-squared:  -0.03906 
## F-statistic: 0.7619 on 3 and 16 DF,  p-value: 0.5318
summary(penicillin.lm.crd)$sig # Get variance for completely randomized design
## [1] 5.533986
summary(penicillin.lm)$sig # Get variance for randomized complete block design
## [1] 4.339739
(5.533986/4.339739)^2 # Computing relative efficiency
## [1] 1.626106

The blend (blocking) effect has a p-value of 0.04075 (small p-value), which is less than the 5% significance level (alpha = 0.05) so therefore I conclude that the blend effect is statistically significant and hence there is a difference between the blends.

Using blends in the design and modeling improved the outcome because the randomized complete block design is 1.626106 times more efficient than the completely randomized design, which means the precision of the outcome (yield (production of penicillin)) increased by approximately 63%. The efficiency that was gained by the blocked design is approximately 63% (blends included in the design and modeling). Therefore, the completely randomized design (blends not included in the design and modeling) needs about 63% more observations to obtain the same level of precision as the blocked design (blends included in the design and modeling), which is why using blends in the design and modeling improved the outcome (yield (production of penicillin)).

d.

# Creating normal Q-Q plot of residuals:
qqnorm(residuals(penicillin.lm))
qqline(residuals(penicillin.lm)) # Adding Q-Q line

# Creating residuals vs. fitted values plot:
plot(fitted(penicillin.lm), residuals(penicillin.lm), xlab = "Fitted",
     ylab = "Residuals", main = "Residuals vs. Fitted Values")
abline(h=0) # Adding horizontal line at 0

# Shapiro-Wilk normality test if the residuals are normal:
shapiro.test(residuals(penicillin.lm))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(penicillin.lm)
## W = 0.95047, p-value = 0.3743
# Test for a difference in the variance:
penicillin$residuals <- sqrt(abs(residuals(penicillin.lm))) # Square-root of residuals
penicillin.lm.var <- lm(residuals ~ treat + blend, penicillin) # Creating new linear model
anova(penicillin.lm.var) # ANOVA test
## Analysis of Variance Table
## 
## Response: residuals
##           Df Sum Sq Mean Sq F value Pr(>F)
## treat      3 0.6261 0.20869  0.3856 0.7654
## blend      4 2.4884 0.62211  1.1495 0.3802
## Residuals 12 6.4943 0.54119
library(lmtest) # Load lmtest library to use the dwtest function

# Durbin-Watson test for correlated errors:
dwtest(yield ~ treat + blend, data = penicillin)
## 
##  Durbin-Watson test
## 
## data:  yield ~ treat + blend
## DW = 2.5973, p-value = 0.6853
## alternative hypothesis: true autocorrelation is greater than 0

When looking at the normal Q-Q plot, most of the residuals approximately lie on the Q-Q line so therefore the residuals are approximately normal. Also, the p-value from the Shapiro-Wilk normality test is 0.3743 (large p-value), which is greater than the 5% significance level (alpha = 0.05) so therefore I do not reject the null hypothesis that the residuals are normal and conclude that there is strong evidence suggesting the residuals are normal. Hence, the normality assumption has been satisfied.

In addition, in the residuals vs. fitted values plot the residuals seem to be evenly spread out around the horizontal line at 0 and form a horizontal band around the horizontal line at 0, which suggests that the error variances are homogeneous to one another. In addition, the p-value from the ANOVA test for the treatments is 0.7654 and the p-value for the blends is 0.3802, both of the p-values are large, which allows for the conclusion that there is no strong evidence against of constant variance for both the treatments and blends. Hence, the assumption of constant variance is satisfied.

There is no symmetry in the residuals vs. fitted values plot, which suggests that the residuals are independent of one another. The p-value obtained from the Durbin-Watson test is 0.6853 (large p-value), which is greater than the 5% significance level (alpha = 0.05) so therefore I do not reject the null hypothesis that the errors are uncorrelated and conclude that the there is strong evidence of the errors being uncorrelated. Hence, the independence assumption is satisfied.

Q2.

a.

library(ggplot2) # Loading ggplot2 library to use ggplot function to make plots

# Creating Boxplots of butterfat data:
# Boxplot of Butterfat with Breed as the x-axis
plot(Butterfat ~ Breed, butterfat, main = "Butterfat Data Plot by Breed")

# Boxplot Butterfat with Age as the x-axis
plot(Butterfat ~ Age, butterfat, main = "Butterfat Data Plot by Age")

# Simple plots:
# Plotting Butterfat with Breed as the x-axis
with(butterfat, interaction.plot(Breed, Age, Butterfat, ylab = "Butterfat",
                                 main = "Butterfat Data Plot by Breed"))

# Plotting Butterfat with Age as the x-axis
with(butterfat, interaction.plot(Age, Breed, Butterfat, ylab = "Butterfat",
                                 main = "Butterfat Data Plot by Age"))

# Complex plots:
# Plotting Butterfat with Breed as the x-axis and with some horizontal jittering
ggplot(butterfat, aes(x = Breed, y = Butterfat, shape = Age)) +
  geom_point(position = position_jitter(width = 0.1)) +
  stat_summary(fun = "mean", geom = "line", aes(group = Age, linetype = Age)) +
  theme(legend.position = "top", legend.direction = "horizontal") +
  ggtitle("Butterfat Data Plot by Breed")

# Plotting Butterfat with Age as the x-axis and with some horizontal jittering
ggplot(butterfat, aes(x = Age, y = Butterfat, shape = Breed)) +
  geom_point(position = position_jitter(width = 0.1)) +
  stat_summary(fun = "mean", geom = "line", aes(group = Breed, linetype = Breed)) +
  theme(legend.position = "top", legend.direction = "horizontal") +
  ggtitle("Butterfat Data Plot by Age")

b.

# Fitting linear model:
butterfat.lm <- lm(Butterfat ~ Breed*Age, data = butterfat)
summary(butterfat.lm) # Get results
## 
## Call:
## lm(formula = Butterfat ~ Breed * Age, data = butterfat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0190 -0.2720 -0.0430  0.2372  1.3170 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       3.9660     0.1316  30.143  < 2e-16 ***
## BreedCanadian                     0.5220     0.1861   2.805  0.00616 ** 
## BreedGuernsey                     0.9330     0.1861   5.014 2.65e-06 ***
## BreedHolstein-Fresian            -0.3030     0.1861  -1.628  0.10693    
## BreedJersey                       1.1670     0.1861   6.272 1.22e-08 ***
## AgeMature                         0.1880     0.1861   1.010  0.31503    
## BreedCanadian:AgeMature          -0.2870     0.2631  -1.091  0.27834    
## BreedGuernsey:AgeMature          -0.0860     0.2631  -0.327  0.74457    
## BreedHolstein-Fresian:AgeMature  -0.1750     0.2631  -0.665  0.50773    
## BreedJersey:AgeMature             0.1310     0.2631   0.498  0.61982    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4161 on 90 degrees of freedom
## Multiple R-squared:  0.6926, Adjusted R-squared:  0.6619 
## F-statistic: 22.53 on 9 and 90 DF,  p-value: < 2.2e-16
# ANOVA test if the interaction effect is significant:
anova(butterfat.lm)
## Analysis of Variance Table
## 
## Response: Butterfat
##           Df Sum Sq Mean Sq F value Pr(>F)    
## Breed      4 34.321  8.5803 49.5651 <2e-16 ***
## Age        1  0.274  0.2735  1.5801 0.2120    
## Breed:Age  4  0.514  0.1285  0.7421 0.5658    
## Residuals 90 15.580  0.1731                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value obtained for the interaction term between breed and age from the ANOVA test is 0.5658 (large p-value), which is greater than the 5% significance level (alpha = 0.05) so therefore there is a significant interaction effect between breed and age. Hence, there is an interaction between breed and age.

c.

# Fitting new linear model without the interaction term between breed and age:
butterfat.lm2 <- lm(Butterfat ~ Breed + Age, data = butterfat)
summary(butterfat.lm2) # Get results
## 
## Call:
## lm(formula = Butterfat ~ Breed + Age, data = butterfat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0202 -0.2373 -0.0640  0.2617  1.2098 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.00770    0.10135  39.541  < 2e-16 ***
## BreedCanadian          0.37850    0.13085   2.893  0.00475 ** 
## BreedGuernsey          0.89000    0.13085   6.802 9.48e-10 ***
## BreedHolstein-Fresian -0.39050    0.13085  -2.984  0.00362 ** 
## BreedJersey            1.23250    0.13085   9.419 3.16e-15 ***
## AgeMature              0.10460    0.08276   1.264  0.20937    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4138 on 94 degrees of freedom
## Multiple R-squared:  0.6825, Adjusted R-squared:  0.6656 
## F-statistic: 40.41 on 5 and 94 DF,  p-value: < 2.2e-16
# ANOVA test if there is a statistically significant difference for breeds and for ages:
anova(butterfat.lm2)
## Analysis of Variance Table
## 
## Response: Butterfat
##           Df Sum Sq Mean Sq F value Pr(>F)    
## Breed      4 34.321  8.5803 50.1150 <2e-16 ***
## Age        1  0.274  0.2735  1.5976 0.2094    
## Residuals 94 16.094  0.1712                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Computing Tukey's HSD 95% Confidence Intervals for Breed and Age:
TukeyHSD(aov(Butterfat ~ Breed + Age, butterfat))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Butterfat ~ Breed + Age, data = butterfat)
## 
## $Breed
##                              diff        lwr        upr     p adj
## Canadian-Ayrshire          0.3785  0.0145538  0.7424462 0.0373310
## Guernsey-Ayrshire          0.8900  0.5260538  1.2539462 0.0000000
## Holstein-Fresian-Ayrshire -0.3905 -0.7544462 -0.0265538 0.0290906
## Jersey-Ayrshire            1.2325  0.8685538  1.5964462 0.0000000
## Guernsey-Canadian          0.5115  0.1475538  0.8754462 0.0016067
## Holstein-Fresian-Canadian -0.7690 -1.1329462 -0.4050538 0.0000006
## Jersey-Canadian            0.8540  0.4900538  1.2179462 0.0000000
## Holstein-Fresian-Guernsey -1.2805 -1.6444462 -0.9165538 0.0000000
## Jersey-Guernsey            0.3425 -0.0214462  0.7064462 0.0752825
## Jersey-Holstein-Fresian    1.6230  1.2590538  1.9869462 0.0000000
## 
## $Age
##                diff         lwr       upr     p adj
## Mature-2year 0.1046 -0.05971342 0.2689134 0.2093695

The p-value obtained from the ANOVA test for Breed is <2e-16 (small p-value), which is less than the 5% significance level (alpha = 0.05) so therefore I reject the null hypothesis that all breeds have equal mean butterfat and conclude that there is a statistically significant difference between breeds.

The pairwise differences that are not statistically significant is only Jersey and Guernsey because its corresponding confidence interval contains 0 and also its corresponding p-value (p-value = 0.0752825 is large) is greater than the 5% significance level (alpha = 0.05). The rest of the pairwise differences are statistically significant, which are Canadian and Ayrshire, Guernsey and Ayrshire, Holstein-Fresian and Ayrshire, Jersey and Ayrshire, Guernsey and Canadian, Holstein-Fresian and Canadian, Jersey and Canadian, Holstein-Fresian and Guernsey, and Jersey and Holstein-Fresian.

The p-value obtained from the ANOVA test for Age is 0.2094 (large p-value), which is greater than the 5% significance level (alpha = 0.05) so therefore I do not reject the null hypothesis that all ages have equal mean butterfat and conclude that there is not a statistically significant difference between ages.

The pairwise difference Mature and 2year is not statistically significant because its corresponding confidence interval contains 0 and also its corresponding p-value (p-value = 0.2093695 is large) is greater than the 5% significance level (alpha = 0.05).

d.

# Fitting the new chosen linear model without the Age predictor:
butterfat.lm3 <- lm(Butterfat ~ Breed, data = butterfat)
summary(butterfat.lm3) # Get results
## 
## Call:
## lm(formula = Butterfat ~ Breed, data = butterfat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07250 -0.27213 -0.05125  0.22363  1.25750 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.06000    0.09281  43.743  < 2e-16 ***
## BreedCanadian          0.37850    0.13126   2.884  0.00486 ** 
## BreedGuernsey          0.89000    0.13126   6.780 1.01e-09 ***
## BreedHolstein-Fresian -0.39050    0.13126  -2.975  0.00371 ** 
## BreedJersey            1.23250    0.13126   9.390 3.33e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4151 on 95 degrees of freedom
## Multiple R-squared:  0.6771, Adjusted R-squared:  0.6635 
## F-statistic:  49.8 on 4 and 95 DF,  p-value: < 2.2e-16
# Creating normal Q-Q plot of residuals:
qqnorm(residuals(butterfat.lm3))
qqline(residuals(butterfat.lm3)) # Adding Q-Q line

# Creating residuals vs. fitted values plot:
plot(fitted(butterfat.lm3), residuals(butterfat.lm3), xlab = "Fitted",
     ylab = "Residuals", main = "Residuals vs. Fitted Values")
abline(h=0) # Adding horizontal line at 0

# Creating boxplot of residuals for the breeds:
plot(residuals(butterfat.lm3) ~ Breed, butterfat, ylab = "Residuals",
     main = "Butterfat Content of Milk Residuals from 5 Different Breeds")

# Shapiro-Wilk normality test if the residuals are normal:
shapiro.test(residuals(butterfat.lm3))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(butterfat.lm3)
## W = 0.96805, p-value = 0.01571
# Test for a difference in the variance
butterfat$residuals <- sqrt(abs(residuals(butterfat.lm3))) # Square-root of residuals
butterfat.lm3.var <- lm(residuals ~ Breed, butterfat) # Creating new linear model
anova(butterfat.lm3.var) # ANOVA test
## Analysis of Variance Table
## 
## Response: residuals
##           Df Sum Sq  Mean Sq F value  Pr(>F)  
## Breed      4 0.6767 0.169183  3.2675 0.01479 *
## Residuals 95 4.9189 0.051778                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Bartlett test:
bartlett.test(Butterfat ~ Breed, butterfat)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Butterfat by Breed
## Bartlett's K-squared = 20.251, df = 4, p-value = 0.0004455
# Durbin-Watson test for correlated errors:
dwtest(Butterfat ~ Breed, data = butterfat)
## 
##  Durbin-Watson test
## 
## data:  Butterfat ~ Breed
## DW = 2.0798, p-value = 0.4996
## alternative hypothesis: true autocorrelation is greater than 0
# Fitting the new linear model with log transformation:
butterfat.lm4 <- lm(log(Butterfat) ~ Breed, data = butterfat)
summary(butterfat.lm4) # Get results
## 
## Call:
## lm(formula = log(Butterfat) ~ Breed, data = butterfat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.238995 -0.061397 -0.005733  0.051988  0.219157 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.39919    0.01939  72.173  < 2e-16 ***
## BreedCanadian          0.08798    0.02742   3.209 0.001816 ** 
## BreedGuernsey          0.19564    0.02742   7.136 1.90e-10 ***
## BreedHolstein-Fresian -0.10139    0.02742  -3.698 0.000364 ***
## BreedJersey            0.26112    0.02742   9.524 1.72e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0867 on 95 degrees of freedom
## Multiple R-squared:  0.7046, Adjusted R-squared:  0.6922 
## F-statistic: 56.65 on 4 and 95 DF,  p-value: < 2.2e-16
# Creating normal Q-Q plot of residuals:
qqnorm(residuals(butterfat.lm4))
qqline(residuals(butterfat.lm4)) # Adding Q-Q line

# Creating residuals vs. fitted values plot:
plot(fitted(butterfat.lm4), residuals(butterfat.lm4), xlab = "Fitted",
     ylab = "Residuals", main = "Residuals vs. Fitted Values")
abline(h=0) # Adding horizontal line at 0

# Creating boxplot of residuals for the breeds:
plot(residuals(butterfat.lm4) ~ Breed, butterfat, ylab = "Residuals",
     main = "Butterfat Content of Milk Residuals from 5 Different Breeds")

When looking at the boxplots of the different breeds, the Holstein-Fresian has 1 outlier, whereas the other breeds (Ayrshire, Canadian, Guernsey, and Jersey) have no outliers. The median for the Holstein-Fresian breed is close to its lower quartile. There is positive skewness for the breed Holstein-Fresian, and somewhat negative skewness for the breed Ayrshire. This suggests that there is a lack of normality in the data. Also, the equality of variance between the breeds does not seem to be satisfied.

When looking at the normal Q-Q plot, most of the residuals approximately lie on the Q-Q line, but some of the residuals fall off of the Q-Q line near the end, which suggests the errors are short-tailed. Also, the p-value from the Shapiro-Wilk normality test is 0.01571 (small p-value), which is less than the 5% significance level (alpha = 0.05) so therefore I reject the null hypothesis that the residuals are normal and conclude that there is strong evidence suggesting the residuals are not normal. Hence, the normality assumption has not been met.

In addition, in the residuals vs. fitted values plot the residuals seem to not be evenly spread out around the horizontal line at 0 and do not form a horizontal band around the horizontal line at 0, which suggests that the error variances are not homogeneous to one another. Also, in the residuals vs. fitted values plot the variation is increasing as the response increases. In addition, the p-value from the ANOVA test for Breed is 0.01479 and the p-value from the Bartlett test is 0.0004455, both the p-values are small, which allows for the conclusion that there is strong evidence of non-constant variance. Hence, the assumption of constant variance is not satisfied.

There is no symmetry in the residuals vs. fitted values plot, which suggests that the residuals are independent of one another. The p-value obtained from the Durbin-Watson test is 0.4996 (large p-value), which is greater than the 5% significance level (alpha = 0.05) so therefore I do not reject the null hypothesis that the errors are uncorrelated and conclude that the there is strong evidence of the errors being uncorrelated. Hence, the independence assumption is satisfied.

All together this suggests that a transformation of the response is likely needed. Hence, a log transformation of the response could be used. After using the log transformation on the response, from the new Q-Q plot the normality assumption seems to be met since most of the residuals lie on the Q-Q line and the constant variance assumption seems to be met since the residuals form a horizontal band around the horizontal line at 0 in the new residuals vs. fitted values plot. Also, there is no symmetry in the residuals vs. fitted values plot, which suggests that the residuals are independent of one another. Hence, the assumptions have been meant under the log transformation of the response.

e.

# Without transformation of the response:
summary(butterfat.lm3) # Get results
## 
## Call:
## lm(formula = Butterfat ~ Breed, data = butterfat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07250 -0.27213 -0.05125  0.22363  1.25750 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.06000    0.09281  43.743  < 2e-16 ***
## BreedCanadian          0.37850    0.13126   2.884  0.00486 ** 
## BreedGuernsey          0.89000    0.13126   6.780 1.01e-09 ***
## BreedHolstein-Fresian -0.39050    0.13126  -2.975  0.00371 ** 
## BreedJersey            1.23250    0.13126   9.390 3.33e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4151 on 95 degrees of freedom
## Multiple R-squared:  0.6771, Adjusted R-squared:  0.6635 
## F-statistic:  49.8 on 4 and 95 DF,  p-value: < 2.2e-16
# Computing Tukey's HSD 95% Confidence Intervals:
TukeyHSD(aov(Butterfat ~ Breed, butterfat))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Butterfat ~ Breed, data = butterfat)
## 
## $Breed
##                              diff         lwr         upr     p adj
## Canadian-Ayrshire          0.3785  0.01348598  0.74351402 0.0381804
## Guernsey-Ayrshire          0.8900  0.52498598  1.25501402 0.0000000
## Holstein-Fresian-Ayrshire -0.3905 -0.75551402 -0.02548598 0.0297910
## Jersey-Ayrshire            1.2325  0.86748598  1.59751402 0.0000000
## Guernsey-Canadian          0.5115  0.14648598  0.87651402 0.0016669
## Holstein-Fresian-Canadian -0.7690 -1.13401402 -0.40398598 0.0000007
## Jersey-Canadian            0.8540  0.48898598  1.21901402 0.0000000
## Holstein-Fresian-Guernsey -1.2805 -1.64551402 -0.91548598 0.0000000
## Jersey-Guernsey            0.3425 -0.02251402  0.70751402 0.0767002
## Jersey-Holstein-Fresian    1.6230  1.25798598  1.98801402 0.0000000
# With the log transformation of the response:
summary(butterfat.lm4) # Get results
## 
## Call:
## lm(formula = log(Butterfat) ~ Breed, data = butterfat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.238995 -0.061397 -0.005733  0.051988  0.219157 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.39919    0.01939  72.173  < 2e-16 ***
## BreedCanadian          0.08798    0.02742   3.209 0.001816 ** 
## BreedGuernsey          0.19564    0.02742   7.136 1.90e-10 ***
## BreedHolstein-Fresian -0.10139    0.02742  -3.698 0.000364 ***
## BreedJersey            0.26112    0.02742   9.524 1.72e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0867 on 95 degrees of freedom
## Multiple R-squared:  0.7046, Adjusted R-squared:  0.6922 
## F-statistic: 56.65 on 4 and 95 DF,  p-value: < 2.2e-16
# Computing Tukey's HSD 95% Confidence Intervals:
TukeyHSD(aov(log(Butterfat) ~ Breed, butterfat))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = log(Butterfat) ~ Breed, data = butterfat)
## 
## $Breed
##                                  diff         lwr         upr     p adj
## Canadian-Ayrshire          0.08798426  0.01174182  0.16422671 0.0152636
## Guernsey-Ayrshire          0.19564150  0.11939906  0.27188395 0.0000000
## Holstein-Fresian-Ayrshire -0.10139038 -0.17763283 -0.02514793 0.0032753
## Jersey-Ayrshire            0.26111862  0.18487617  0.33736107 0.0000000
## Guernsey-Canadian          0.10765724  0.03141479  0.18389969 0.0015028
## Holstein-Fresian-Canadian -0.18937464 -0.26561709 -0.11313219 0.0000000
## Jersey-Canadian            0.17313436  0.09689191  0.24937681 0.0000001
## Holstein-Fresian-Guernsey -0.29703188 -0.37327433 -0.22078943 0.0000000
## Jersey-Guernsey            0.06547712 -0.01076533  0.14171957 0.1275040
## Jersey-Holstein-Fresian    0.36250900  0.28626655  0.43875145 0.0000000

In both cases (with and without transformation of the response), the best breed in terms of butterfat content is Jersey and the second best breed in terms of butterfat content is Guernsey. However in both cases, Jersey and Guernsey are not significantly different from each other because its pairwise difference has a corresponding confidence interval that contains 0 and also its corresponding p-value (large p-value) is greater than the 5% significance level (alpha = 0.05). Hence, the best breed in terms of butterfat content Jersey is not clearly superior to the second best breed Guernsey as they are not significantly different.

Q3.

a.

# Getting data for only mature cows:
butterfat.2 <- subset(butterfat, butterfat$Age == "Mature")

# Creating boxplot of the data for only mature cows:
plot(Butterfat ~ Breed, butterfat.2, xlab = "Breed", ylab = "Butterfat",
     main = "Butterfat Content of Milk from 5 Different Breeds of Mature Cows")

# Creating stripchart of the data of the data for only mature cows:
stripchart(Butterfat ~ Breed, butterfat.2, vertical = TRUE, method = "stack",
           xlab = "Breed", ylab = "Butterfat",
           main = "Butterfat Content of Milk from 5 Different Breeds of Mature Cows")

Looking at the boxplots of the different breeds of mature cows, the Canadian breed has 2 outliers and the Holstein-Fresian has 1 outlier, whereas the other breeds (Ayrshire, Guernsey, and Jersey) have no outliers. The median for the Canadian breed is close to its upper quartile and the medians for the breeds Holstein-Fresian and Jersey are close to its lower quartile. There is positive skewness for the breeds Holstein-Fresian and Jersey, and somewhat for the breed Ayrshire. In addition, there is negative skewness for the breed Canadian. This suggests that there is a lack of normality in the data. Also, the equality of variance between the breeds does not seem to be satisfied. Hence, all together this suggests a transformation of the response might be needed. Ayrshire, Canadian, and Holstein-Fresian all have similar variation, whereas Guernsey and Jersey have similar variation.

When looking at the stripchart there are ties in the data for the breeds Jersey, and Holstein-Fresian.

b.

# Fitting linear model:
butterfat.lm.2 <- lm(Butterfat ~ Breed, data = butterfat.2)
summary(butterfat.lm.2) # Get results
## 
## Call:
## lm(formula = Butterfat ~ Breed, data = butterfat.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6520 -0.2720 -0.0430  0.2005  1.0980 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.1540     0.1138  36.493  < 2e-16 ***
## BreedCanadian           0.2350     0.1610   1.460  0.15128    
## BreedGuernsey           0.8470     0.1610   5.262 3.83e-06 ***
## BreedHolstein-Fresian  -0.4780     0.1610  -2.969  0.00477 ** 
## BreedJersey             1.2980     0.1610   8.063 2.79e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.36 on 45 degrees of freedom
## Multiple R-squared:  0.7709, Adjusted R-squared:  0.7506 
## F-statistic: 37.86 on 4 and 45 DF,  p-value: 7.284e-14
# ANOVA Test for a significant difference between the breeds (levels):
anova(butterfat.lm.2) # same as anova(lm(Butterfat ~ Breed, data = butterfat.2))
## Analysis of Variance Table
## 
## Response: Butterfat
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Breed      4 19.6240  4.9060  37.864 7.284e-14 ***
## Residuals 45  5.8306  0.1296                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value obtained from the ANOVA test is 7.284e-14 (small p-value), which is less than the 5% significance level (alpha = 0.05) so I reject the null hypothesis that all breeds have equal mean butterfat and conclude that there is a significant difference between the breeds.

c.

# Creating normal Q-Q plot of residuals:
qqnorm(residuals(butterfat.lm.2))
qqline(residuals(butterfat.lm.2)) # Adding Q-Q line

# Creating residuals vs. fitted values plot:
plot(jitter(fitted(butterfat.lm.2)), residuals(butterfat.lm.2), xlab = "Fitted",
     ylab = "Residuals", main = "Residuals vs. Fitted Values")
abline(h = 0) # Adding horizontal line at 0

# Levene's test:
median.butterfat <- with(butterfat.2, tapply(Butterfat, Breed, median)) # Computing medians
# Computing absolute values of residuals
ar.butterfat <- with(butterfat.2, abs(Butterfat - median.butterfat[Breed]))
anova(lm(ar.butterfat ~ Breed, butterfat.2)) # Computing ANOVA test
## Analysis of Variance Table
## 
## Response: ar.butterfat
##           Df  Sum Sq  Mean Sq F value Pr(>F)
## Breed      4 0.26537 0.066343  0.9548 0.4415
## Residuals 45 3.12679 0.069484
# Bartlett test:
bartlett.test(Butterfat ~ Breed, butterfat.2)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  Butterfat by Breed
## Bartlett's K-squared = 6.6257, df = 4, p-value = 0.157
# Shapiro-Wilk normality test if the residuals are normal:
shapiro.test(residuals(butterfat.lm.2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(butterfat.lm.2)
## W = 0.95549, p-value = 0.05757
# Durbin-Watson test for correlated errors:
dwtest(Butterfat ~ Breed, data = butterfat.2)
## 
##  Durbin-Watson test
## 
## data:  Butterfat ~ Breed
## DW = 1.7725, p-value = 0.08027
## alternative hypothesis: true autocorrelation is greater than 0

When looking at the normal Q-Q plot, most of the residuals approximately lie on the Q-Q line so therefore the residuals are approximately normal. In addition, in the residuals vs. fitted values plot the residuals seem to approximately form a horizontal band around the horizontal line at 0, which suggests that the error variances are homogeneous to one another.

P-value from the Levene’s test is 0.4415 (large p-value), which is greater than the 1% significance level (alpha = 0.01) so I conclude that there is no evidence of a non-constant variance (for the errors). This is also confirmed by the p-value from the Bartlett test as the p-value is large (p-value = 0.157). Hence, the assumption of constant variance is satisfied.

Also, the p-value from the Shapiro-Wilk normality test is 0.05757 (large p-value), which is greater than the 5% significance level (alpha = 0.05) so therefore I do not reject the null hypothesis that the residuals are normal and conclude that there is strong evidence suggesting the residuals are normal. Hence, the normality assumption has been met.

There is no symmetry in the residuals vs. fitted values plot, which suggests that the residuals are independent of one another. The p-value obtained from the Durbin-Watson test is 0.08027 (large p-value), which is greater than the 5% significance level (alpha = 0.05) so therefore I do not reject the null hypothesis that the errors are uncorrelated and conclude that the there is strong evidence of the errors being uncorrelated. Hence, the independence assumption is satisfied.

All together this suggests that a transformation of the response is not needed.

d.

# Computing Tukey's HSD 95% Confidence Intervals:
tukey.CI <- TukeyHSD(aov(Butterfat ~ Breed, data = butterfat.2))
tukey.CI # Get results
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Butterfat ~ Breed, data = butterfat.2)
## 
## $Breed
##                             diff          lwr         upr     p adj
## Canadian-Ayrshire          0.235 -0.222410494  0.69241049 0.5932057
## Guernsey-Ayrshire          0.847  0.389589506  1.30441049 0.0000366
## Holstein-Fresian-Ayrshire -0.478 -0.935410494 -0.02058951 0.0365566
## Jersey-Ayrshire            1.298  0.840589506  1.75541049 0.0000000
## Guernsey-Canadian          0.612  0.154589506  1.06941049 0.0037519
## Holstein-Fresian-Canadian -0.713 -1.170410494 -0.25558951 0.0005502
## Jersey-Canadian            1.063  0.605589506  1.52041049 0.0000004
## Holstein-Fresian-Guernsey -1.325 -1.782410494 -0.86758951 0.0000000
## Jersey-Guernsey            0.451 -0.006410494  0.90841049 0.0549908
## Jersey-Holstein-Fresian    1.776  1.318589506  2.23341049 0.0000000
# Tukey's HSD 95% Confidence Intervals Plot:
plot(tukey.CI)

The pairwise differences that are not statistically significant are Canadian and Ayrshire, and Jersey and Guernsey because their corresponding confidence intervals contain 0. Also, this is because their corresponding p-values (p-values are large) are greater than the 5% significance level (alpha = 0.05). The rest of the pairwise difference are statistically significant, which are Guernsey and Ayrshire, Holstein-Fresian and Ayrshire, Jersey and Ayrshire, Guernsey and Canadian, Holstein-Fresian and Canadian, Jersey and Canadian, Holstein-Fresian and Guernsey, and Jersey and Holstein-Fresian.

Q4.

a.

head(fortune) # To get a view of the data
##   wealth age region
## 1   37.0  50      M
## 2   24.0  88      U
## 3   14.0  64      A
## 4   13.0  63      U
## 5   13.0  66      U
## 6   11.7  72      E
levels(fortune$region) # To see the levels of region
## [1] "A" "E" "M" "O" "U"
fortune.2 <- na.omit(fortune) # Omitting observations that have NA values

# Plotting the wealth as a function of age:
plot(wealth ~ age, fortune.2, pch = unclass(region), xlab = "Age",
     ylab = "Wealth (billions of $)",
     main = "Wealth for Billionaires with different Ages in 5 Regions of the World")
legend(4, 38, levels(fortune.2$region), pch = 1:5) # Adding a legend for the different regions

b.

# Plotting the wealth as a function of age with a separate panel for each region:
ggplot(aes(x = age, y = wealth), data = fortune.2) +
  geom_point() +
  facet_wrap( ~ region) +
  xlab("Age") +
  ylab("Wealth (billions of $)") +
  ggtitle("Wealth for Billionaires with different Ages for each of the 5 Regions of the World")

c.

# Fitting linear model:
fortune.lm <- lm(wealth ~ age*region, data = fortune.2)
summary(fortune.lm) # Get results
## 
## Call:
## lm(formula = wealth ~ age * region, data = fortune.2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.737 -1.243 -0.736  0.499 32.357 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.721453   3.626188   0.750    0.454
## age         -0.001101   0.056305  -0.020    0.984
## regionE     -0.920828   4.020305  -0.229    0.819
## regionM      3.255513   4.415284   0.737    0.462
## regionO      0.526521   4.861590   0.108    0.914
## regionU     -2.894546   4.329271  -0.669    0.504
## age:regionE  0.008243   0.062202   0.133    0.895
## age:regionM -0.025575   0.067704  -0.378    0.706
## age:regionO -0.014012   0.074971  -0.187    0.852
## age:regionU  0.050569   0.066972   0.755    0.451
## 
## Residual standard error: 3.364 on 215 degrees of freedom
## Multiple R-squared:  0.04241,    Adjusted R-squared:  0.002325 
## F-statistic: 1.058 on 9 and 215 DF,  p-value: 0.3953
# Creating residuals vs. fitted values plot:
plot(residuals(fortune.lm) ~ fitted(fortune.lm), pch = unclass(fortune.2$region),
     xlab = "Fitted Values", ylab = "Residuals", main = "Residuals vs. Fitted Values")
abline(h = 0) # Adding horizontal line at 0
legend(0.825, 33, levels(fortune.2$region), pch = 1:5) # Adding a legend for the different regions

# Creating normal Q-Q plot of residuals:
qqnorm(residuals(fortune.lm))
qqline(residuals(fortune.lm)) # Adding Q-Q line

# Looking at the plots there is non-constant variance that is unrelated to the 5 regions.
# Hence, there is heteroscedasticity in the linear model.
# Therefore, I will use a log transformation to remove the heteroscedasticity.

# Fitting new linear model with log transformation of the response:
fortune.lm2 <- lm(log(wealth) ~ age*region, data = fortune.2)
summary(fortune.lm2) # Get results
## 
## Call:
## lm(formula = log(wealth) ~ age * region, data = fortune.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8948 -0.4375 -0.1967  0.3705  2.8901 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.918241   0.672231   1.366    0.173
## age         -0.001865   0.010438  -0.179    0.858
## regionE     -0.366568   0.745293  -0.492    0.623
## regionM     -0.715188   0.818515  -0.874    0.383
## regionO      0.040727   0.901252   0.045    0.964
## regionU     -0.232411   0.802570  -0.290    0.772
## age:regionE  0.003485   0.011531   0.302    0.763
## age:regionM  0.012220   0.012551   0.974    0.331
## age:regionO -0.002243   0.013898  -0.161    0.872
## age:regionU  0.003242   0.012415   0.261    0.794
## 
## Residual standard error: 0.6237 on 215 degrees of freedom
## Multiple R-squared:  0.0253, Adjusted R-squared:  -0.0155 
## F-statistic:  0.62 on 9 and 215 DF,  p-value: 0.7793
# Creating residuals vs. fitted values plot:
plot(residuals(fortune.lm2) ~ fitted(fortune.lm2), pch = unclass(fortune.2$region),
     xlab = "Fitted Values", ylab = "Residuals", main = "Residuals vs. Fitted Values")
abline(h = 0) # Adding horizontal line at 0
legend(0.27, 2.9, levels(fortune.2$region), pch = 1:5) # Adding a legend for the different regions

# Creating normal Q-Q plot of residuals:
qqnorm(residuals(fortune.lm2))
qqline(residuals(fortune.lm2)) # Adding Q-Q line

A log transformation on the response can be used to facilitate linear modeling, as the log transformation removed most of the heteroscedasticity from looking at the Q-Q plot and residuals vs. fitted values plot.

d.

summary(fortune.lm2) # Get results of linear model with log transformation of the response
## 
## Call:
## lm(formula = log(wealth) ~ age * region, data = fortune.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8948 -0.4375 -0.1967  0.3705  2.8901 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.918241   0.672231   1.366    0.173
## age         -0.001865   0.010438  -0.179    0.858
## regionE     -0.366568   0.745293  -0.492    0.623
## regionM     -0.715188   0.818515  -0.874    0.383
## regionO      0.040727   0.901252   0.045    0.964
## regionU     -0.232411   0.802570  -0.290    0.772
## age:regionE  0.003485   0.011531   0.302    0.763
## age:regionM  0.012220   0.012551   0.974    0.331
## age:regionO -0.002243   0.013898  -0.161    0.872
## age:regionU  0.003242   0.012415   0.261    0.794
## 
## Residual standard error: 0.6237 on 215 degrees of freedom
## Multiple R-squared:  0.0253, Adjusted R-squared:  -0.0155 
## F-statistic:  0.62 on 9 and 215 DF,  p-value: 0.7793
# Centering the Age Predictor by its mean value:
fortune.2$center.age <- fortune.2$age - mean(fortune.2$age)

# Fitting new linear model with Age centered by its mean value:
fortune.lm2.center <- lm(log(wealth) ~ center.age*region, data = fortune.2)
summary(fortune.lm2.center) # Get results
## 
## Call:
## lm(formula = log(wealth) ~ center.age * region, data = fortune.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8948 -0.4375 -0.1967  0.3705  2.8901 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.798830   0.102614   7.785 2.91e-13 ***
## center.age         -0.001865   0.010438  -0.179    0.858    
## regionE            -0.143424   0.125092  -1.147    0.253    
## regionM             0.067290   0.167969   0.401    0.689    
## regionO            -0.102863   0.156281  -0.658    0.511    
## regionU            -0.024833   0.129632  -0.192    0.848    
## center.age:regionE  0.003485   0.011531   0.302    0.763    
## center.age:regionM  0.012220   0.012551   0.974    0.331    
## center.age:regionO -0.002243   0.013898  -0.161    0.872    
## center.age:regionU  0.003242   0.012415   0.261    0.794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6237 on 215 degrees of freedom
## Multiple R-squared:  0.0253, Adjusted R-squared:  -0.0155 
## F-statistic:  0.62 on 9 and 215 DF,  p-value: 0.7793

The Relationship of Age to Wealth (for each of the 5 regions): For a 1-unit increase in age (1-year older) there is an estimated average decrease of -0.001865 in wealth for region A (reference level).

For a 1-unit increase in age (1-year older) there is an estimated average increase of 0.00162 (-0.001865 + 0.003485 = 0.00162) in wealth for region E.

For a 1-unit increase in age (1-year older) there is an estimated average increase of 0.010355 (-0.001865 + 0.012220 = 0.010355) in wealth for region M.

For a 1-unit increase in age (1-year older) there is an estimated average decrease of 0.004108 (-0.001865 - 0.002243 = -0.004108) in wealth for region O.

For a 1-unit increase in age (1-year older) there is an estimated average increase of 0.001377 (-0.001865 + 0.003242 = 0.001377) in wealth for region U.

The Relationship of Region to Wealth (for average age): The estimated average wealth for region A at average age is 0.798830.

The estimated average wealth for region E at average age decreases by $0.143424 billion and is 0.655406 billions of dollars (0.798830 - 0.143424 = 0.655406).

The estimated average wealth for region M at average age increases by $0.067290 billion and is 0.86612 billions of dollars (0.798830 + 0.067290 = 0.86612).

The estimated average wealth for region O at average age decreases by $0.102863 billion and is 0.695967 billions of dollars (0.798830 - 0.102863 = 0.695967).

The estimated average wealth for region U at average age decreases by $0.024833 billion and is 0.773997 billions of dollars (0.798830 - 0.024833 = 0.773997).

e.

# Creating normal Q-Q plot of residuals:
qqnorm(residuals(fortune.lm2))
qqline(residuals(fortune.lm2)) # Adding Q-Q line

# Creating residuals vs. fitted values plot:
plot(residuals(fortune.lm2) ~ fitted(fortune.lm2), pch = unclass(fortune.2$region),
     xlab = "Fitted Values", ylab = "Residuals", main = "Residuals vs. Fitted Values")
abline(h = 0) # Adding horizontal line at 0
legend(0.27, 2.9, levels(fortune.2$region), pch = 1:5) # Adding a legend for the different regions

# Shapiro-Wilk normality test if the residuals are normal:
shapiro.test(residuals(fortune.lm2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(fortune.lm2)
## W = 0.89856, p-value = 3.374e-11
# Test for a difference in the variance:
fortune.2$residuals <- sqrt(abs(residuals(fortune.lm2))) # Square-root of residuals
fortune.lm.var <- lm(residuals ~ age*region, data = fortune.2) # Creating new linear model
anova(fortune.lm.var) # ANOVA test
## Analysis of Variance Table
## 
## Response: residuals
##             Df  Sum Sq  Mean Sq F value Pr(>F)
## age          1  0.0394 0.039375  0.5988 0.4399
## region       4  0.3843 0.096063  1.4609 0.2152
## age:region   4  0.1603 0.040069  0.6094 0.6563
## Residuals  215 14.1374 0.065755
# Durbin-Watson test for correlated errors:
dwtest(log(wealth) ~ age*region, data = fortune.2)
## 
##  Durbin-Watson test
## 
## data:  log(wealth) ~ age * region
## DW = 0.051283, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

When looking at the normal Q-Q plot, a lot of the residuals fall off the Q-Q line so therefore the residuals don’t seem to be normal. Near the beginning of the plot and the end of the plot a lot of the residuals fall off the Q-Q line, which suggests the errors are long-tailed. Also, the p-value from the Shapiro-Wilk normality test is 3.374e-11 (small p-value), which is less than the 5% significance level (alpha = 0.05) so therefore I reject the null hypothesis that the residuals are normal and conclude that there is no strong evidence suggesting the residuals are normal. Hence, the normality assumption has not been satisfied.

In addition, in the residuals vs. fitted values plot the residuals seem to be evenly spread out around the horizontal line at 0 and form a horizontal band around the horizontal line at 0, which suggests that the error variances are homogeneous to one another. In addition, the p-value from the ANOVA test for age is 0.4399, the p-value for region is 0.2152, and the p-value for the interaction term between age and region is 0.6563. All of these p-values are large, which allows for the conclusion that there is no strong evidence against of constant variance for both age, region, and the interaction term between age and region. Hence, the assumption of constant variance is satisfied.

There is no symmetry in the residuals vs. fitted values plot, which suggests that the residuals are independent of one another. However, the p-value obtained from the Durbin-Watson test is <2.2e-16 (small p-value), which is less than the 5% significance level (alpha = 0.05) so therefore I do reject the null hypothesis that the errors are uncorrelated and conclude that the there is no strong evidence of the errors being uncorrelated. Hence, the independence assumption is not satisfied.

Hence, maybe another transformation of the response should be used for the linear model to satisfy the normality and independence assumptions.